Subvert insert_overwrite
merge strategy to bring back merge
#371
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #16
Description
For comment only to see if there's any interest in the idea, not necessarily this particular implementation.
The insert overwrite implementation uses BigQuery scripting variables to extract store the partition values that will be updated, allowing efficient pruning of the insert query with substantial cost and performance benefits. This works well when those values are small enough to fit inside BQ scripting's memory limit of 1 MiB as described in #16, which I guess is going to be hit at 16,384 INT64 IDs.
Prior to the current insert overwrite strategy, @jtcohen6 worked on a similar approach that used minimum and maximum partition values instead in dbt-labs/dbt-core#1971, which works around the memory limit because only two values will be selected. It's not as efficient (unnecessary partitions between the min and the max will still be read), but the worst case performance is that it's no better than the merge + temp table cost (in which case users can opt to use the merge strategy) and in the best case (a contiguous block of IDs, e.g. in append only data) then it will prune partitions as well as the insert overwrite.
This PR just tinkers with the insert overwrite to bring back this approach as an optional variant, but it's probably confusing UX to have this very-much-not-an-insert-overwrite inside the "insert overwrite" strategy, and it has no tests - I made it to confirm that it was possible and that it worked in practice not just in my head, and it did offer significant improvements for me.
If there's interest in bringing this back I'm happy to work on an improved version that addresses these and any other issues raised.
Alternatives Considered
Users can also implement this themselves (to a large extent, at least; I had SQL header issues) by implementing their own
bigquery__get_merge_sql
. I added one in #16 (comment) as an example which is working for me although, as mentioned, I had SQL header issues - they're output to build the temp table and I can't seem to suppress them, so they're output again for the real select.It would be more convenient, simple, and less error prone if this was supported directly.
Users can also fall back to the existing merge approach, but this can be much more expensive.
Checklist
changie new
to create a changelog entry